New York - Population, Venues, Housing Price Analysis

Arun Leo Prakash

18-Jun-2020

Table of Contents

Introduction

New York City (NYC), often called New York (NY), is the most populous city in the United States. With an estimated 2019 population of 8,336,817 distributed over about 302.6 square miles (784 km2), New York is also the most densely populated major city in the United States.

Situated on one of the world's largest natural harbors, New York City is composed of five boroughs, each of which is a county of the State of New York. The five boroughs—Brooklyn, Queens, Manhattan, the Bronx, and Staten Island—were consolidated into a single city in 1898. The city and its metropolitan area constitute the premier gateway for legal immigration to the United States. As many as 800 languages are spoken in New York, making it the most linguistically diverse city in the world. New York is home to more than 3.2 million residents born outside the United States, the largest foreign-born population of any city in the world as of 2016. As of 2019, the New York metropolitan area is estimated to produce a gross metropolitan product (GMP) of $2.0 trillion. If the New York metropolitan area were a sovereign state, it would have the eighth-largest economy in the world. New York is home to the highest number of billionaires of any city in the world.

Business Problem

With a population of 8 million, New York is a city with a high population and produces high GDP. Being such a crowded city leads the owners of shops and social sharing places in the city where the population is dense. Business investors expects lower real estate cost, with high density of population and the type of business they want to install is less intense. It is difficult to obtain information that will guide investors in this direction, nowadays.

When we consider all these problems, we can create a map and information chart where the real estate index is placed on New York and each district is clustered according to the venue density. This would help the investor to decide the ideal location to run the business based on the factors mentioned above

Data

Data requirements includes a. spatial data of new york to build maps with boundaries, b. average sales price per sqm for every borough, c. venue data of the neighborhoods. Venue like coffee shops, entertainment related venues depending on the business requirements would be considered as a baseline parameter.

Data Source

  1. Spatial data for NewYork will be downloaded from https://cocl.us/new_york_dataset/newyork_data.json
  2. Population Data per borough from https://en.wikipedia.org/wiki/New_York_City
  3. Property sales data to collect average sales price from https://www.kaggle.com/new-york-city/nyc-property-sales

Data Aquistion Cleaning

First step, load the new york Geospatial data into dataframe for map generation.

In [2]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In the below section New York Housing Sales Data is loaded, cleansed for using it in our comparison

In [5]:
df_housing = pd.read_csv('nyc-rolling-sales.csv')
In [6]:
##lets remove if there are null values
df_housing=df_housing.dropna() 

# finally check if there is any duplicated value and drop if there are
print(sum(df_housing.duplicated()))
df_housing.drop_duplicates(inplace=True)
0

There are too many columns we need only YEAR BUILT, BOROUGH, SALE PRICE. First step let us remove the columns and load it into a new data frame.

In [7]:
df_housing_clean = df_housing[['BOROUGH', 'GROSS SQUARE FEET', 'LAND SQUARE FEET', 'SALE PRICE']]

df_housing_clean = df_housing_clean.rename(columns={'GROSS SQUARE FEET': 'gsqf',
                                                    'LAND SQUARE FEET': 'lsqf',
                                                    'SALE PRICE' : 'sale_price',
                                                    'BOROUGH' : 'borough'})

df_housing_clean["sale_price"] = df_housing_clean["sale_price"].str.strip()
df_housing_clean["gsqf"] = df_housing_clean["gsqf"].str.strip()
df_housing_clean["lsqf"] = df_housing_clean["lsqf"].str.strip()

df_housing_clean = df_housing_clean.query("sale_price != '-'  ")
df_housing_clean = df_housing_clean.query("gsqf != '-'  ")
df_housing_clean = df_housing_clean.query("lsqf != '-'  ")

df_housing_clean['sale_price'] = pd.to_numeric(df_housing_clean['sale_price'], errors='coerce') 
df_housing_clean['gsqf'] = pd.to_numeric(df_housing_clean['gsqf'], errors='coerce')
df_housing_clean['lsqf'] = pd.to_numeric(df_housing_clean['lsqf'], errors='coerce')
In [8]:
df_housing_clean = df_housing[['BOROUGH', 'SALE PRICE', 'YEAR BUILT', 'TOTAL UNITS']]
In [9]:
df_housing_clean[df_housing_clean['YEAR BUILT']==0]['YEAR BUILT'].count()
Out[9]:
6970
In [10]:
df_housing_clean=df_housing_clean[df_housing_clean['YEAR BUILT']!=0].copy()
In [11]:
df_housing_clean[df_housing_clean['TOTAL UNITS']==0]['TOTAL UNITS'].count()
Out[11]:
16864
In [12]:
df_housing_clean=df_housing_clean[df_housing_clean['TOTAL UNITS']!=0].copy()
In [13]:
df_housing_clean = df_housing_clean.rename(columns={'SALE PRICE' : 'sale_price',
                                                    'BOROUGH' : 'borough'})
In [14]:
df_housing_clean.shape
Out[14]:
(60714, 4)
In [15]:
# Mapping from number reference to borough
#Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5)
state_abbrev = {1: 'Manhattan',
                2: 'Bronx',
                3: 'Brooklyn',
                4: 'Queens',
                5: 'Staten Island'}

# Function to apply
def abbreviate_borough(housing):
    if housing['borough'] in state_abbrev.keys():
        abbrev = state_abbrev[housing['borough']]
        return abbrev
    else:
        return housing['borough']
    
df_housing_clean['borough'] = df_housing_clean.apply(abbreviate_borough, axis=1)

next step, convert the sale_price to numeric

In [16]:
df_housing_clean["sale_price"] = df_housing_clean["sale_price"].str.strip()
In [17]:
df_housing_clean = df_housing_clean.query("sale_price != '-'  ")
In [18]:
df_housing_clean.shape
Out[18]:
(49128, 4)
In [19]:
df_housing_clean['sale_price'] = pd.to_numeric(df_housing_clean['sale_price'], errors='coerce') 
In [20]:
df_housing_clean[(df_housing_clean['sale_price']<10000) | (df_housing_clean['sale_price']>10000000)]['sale_price'].count() /len(df_housing_clean)
Out[20]:
0.2159664549747598

21% of the sale prices are either less than 10,000 or greater than $10,000,000. We have to drop all these observations from the data

In [21]:
df_housing_clean= df_housing_clean[(df_housing_clean['sale_price']>10000) & (df_housing_clean['sale_price']<10000000)].copy()
df_housing_clean.describe()
Out[21]:
sale_price YEAR BUILT TOTAL UNITS
count 3.845700e+04 38457.000000 38457.000000
mean 1.077431e+06 1953.056115 2.191123
std 1.270522e+06 37.838285 17.841976
min 1.000100e+04 1800.000000 1.000000
25% 4.550000e+05 1923.000000 1.000000
50% 6.880000e+05 1941.000000 1.000000
75% 1.110000e+06 1989.000000 2.000000
max 9.999999e+06 2017.000000 2261.000000
In [23]:
df_housing_clean= df_housing_clean[(df_housing_clean['sale_price']<4000000)]
In [24]:
plt.figure(figsize=(12,6))
sns.distplot(df_housing_clean['sale_price'], kde=True, bins=50, rug=True)
plt.show()
In [30]:
plt.figure(figsize=(12,6))
sns.boxplot(y = 'borough', x = 'sale_price', data = df_housing_clean )
plt.title('Box plots for SALE PRICE on each BOROUGH')
plt.show()
In [27]:
plt.figure(figsize=(12,6))
sns.barplot(y = 'borough', x = 'sale_price', data = df_bar )
plt.title('Average SALE PRICE on each BOROUGH')
plt.show()

below Population, Density & GDP data is extracted manually from wikipedia.

In [28]:
df_ny_population = pd.read_csv('nyk_population.csv')
In [29]:
df_ny_population = df_ny_population.sort_values(by='population', ascending=True)
In [30]:
df_ny_population
Out[30]:
borough county population gdp_billions gdp_capita sqms sqkm pers_sqms pers_sqkm
4 Staten Island Richmond 476143 14.514 30500 58.37 151.18 8157 3150
0 Bronx Bronx 1418207 42.695 30100 42.10 109.04 33867 13006
2 Manhattan New York 1628706 600.244 368500 22.83 59.13 71341 27544
3 Queens Queens 2253858 93.310 41400 108.53 281.09 20767 8018
1 Brooklyn Kings 2559903 91.559 35800 70.82 183.42 36147 13957

population/stats data merged with avg housing sales price

In [31]:
df_ny_stats = pd.merge(df_ny_population, df_bar)
In [32]:
df_ny_stats
Out[32]:
borough county population gdp_billions gdp_capita sqms sqkm pers_sqms pers_sqkm sale_price
0 Staten Island Richmond 476143 14.514 30500 58.37 151.18 8157 3150 5.067751e+05
1 Bronx Bronx 1418207 42.695 30100 42.10 109.04 33867 13006 5.579480e+05
2 Manhattan New York 1628706 600.244 368500 22.83 59.13 71341 27544 1.571737e+06
3 Queens Queens 2253858 93.310 41400 108.53 281.09 20767 8018 7.098848e+05
4 Brooklyn Kings 2559903 91.559 35800 70.82 183.42 36147 13957 1.038700e+06

Data Exploration

In [33]:
fig, ax = plt.subplots(figsize=(12, 12), nrows=2, ncols=2)


#first graph - bar plot highlighting avg sale price borough level
g1 = ax[0, 0].bar(data=df_bar, x='borough', height='sale_price');

#second graph pie chart for population
data = df_ny_stats['population']
label = df_ny_stats['borough']
g2= ax[0, 1].pie(data, autopct='%1.1f%%', shadow=True, startangle=90, labels=label);

#third graph pie chart for gdp
data = df_ny_stats['gdp_capita']
label = df_ny_stats['borough']
g3= ax[1,0].pie(data, autopct='%1.1f%%', shadow=True, startangle=90, labels=label);

#fourth graph - bar plot highlighting persons per sqm borough level
g4 = ax[1,1].bar(data=df_ny_stats, x='borough', height='pers_sqkm');

ax[0, 0].title.set_text('Average SALE PRICE on each BOROUGH')
ax[0, 1].title.set_text('Population')
ax[1, 0].title.set_text('GDP')
ax[1, 1].title.set_text('Persons / SQ Meter on each BOROUGH')


plt.suptitle('#Comparison of GDP, Population, Sale Price', y=1.03, fontsize=14, fontweight='semibold');

GDP The Major contributor of GDP is Manhattan, followed by Queens, Brooklyn, Stalen Island & Bronx. Avg sale price Prices were Manhattan, Brooklyn, Queens, Staten Island, Bronx. Person/SQM Person per SQM is high in the order Manhattan, Brooklyn, Bronx, Queens. Population Brooklyn, Queens, Manhattan, Bronx & Staten Island

I wish to perform neighborhoods exploration for Queens considering the facts mentioned in our problem description - lower real estate cost, population and better GDP. This will help in lowering the investment as well as consider better returns. In the next section lets explore the neighborhood data followed by segmentation to find top venues.

Methodology

Google Map, ‘Search Nearby’ option to get the center coordinates of the each Borough. Population data is captured from wiki page. We will be using Foursquare API for Neighborhoods data exploration. The information we want to focus on are shopping venues, coffee shops, and entertainment venues. We will choose top 2 boroughs based on population: Manhattan & Brooklyn. We need to apply Neighborhood Segmentation and Clustering to analyzing the neighborhood data and prioritize the best shopping location in both boroughs based on foot traffic and type of venues available. This helps the investor to choose the best place for business investment.

I am using Folium library to load the newyork map

In [34]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))
The geograpical coordinate of New York City are 40.7127281, -74.0060152.
In [35]:
empty_map = folium.Map(location=[latitude, longitude], zoom_start=10)
In [36]:
map_newyork = empty_map
In [37]:
df_ny_stats = df_ny_stats.rename(columns={'borough' : 'Borough'})
In [38]:
# create map of New York using latitude and longitude values
markers_colors = []

with open('nyborough.geojson') as json_data:
    nyborough_data = json.load(json_data)
    #label = folium.Popup( 'cluster '+str(cluster) + ' - ' + str(poi) + ' : ' + str(int(avg_hp)) + '. NH :' , parse_html=True)
map_newyork.choropleth(
    geo_data=nyborough_data,
    data=df_ny_stats,
    columns=['Borough', 'sale_price'],
    key_on='feature.properties.BoroName',
    fill_color='YlOrRd', 
    #label=df_ny_stats['Borough'], # hover text
    #label=folium.Popup('Mom & Pop Arrow Shop >>', parse_html=True),
    fill_opacity=0.5, 
    line_opacity=0.2,
    legend_name='House Price in NewYork'#,
    #marker_line_color='white', # line markers between states
    #colorbar_title="House Price in NewYork" 
)

map_newyork
Out[38]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Neighborhoods exploration

we are going to start utilizing the Foursquare API to explore the neighborhoods for segment them.

Define Foursquare Credentials and Version

I have Limit to 100 venues within a radius of 500 meters.

In [40]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 750 # define radius

Method is created to reuse the code for both the borough

In [42]:
neighborhoods = neighborhoods[neighborhoods['Borough'] == 'Queens']
In [75]:
venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude'],
                                   radius=radius
                                  )
Astoria
Woodside
Jackson Heights
Elmhurst
Howard Beach
Corona
Forest Hills
Kew Gardens
Richmond Hill
Flushing
Long Island City
Sunnyside
East Elmhurst
Maspeth
Ridgewood
Glendale
Rego Park
Woodhaven
Ozone Park
South Ozone Park
College Point
Whitestone
Bayside
Auburndale
Little Neck
Douglaston
Glen Oaks
Bellerose
Kew Gardens Hills
Fresh Meadows
Briarwood
Jamaica Center
Oakland Gardens
Queens Village
Hollis
South Jamaica
St. Albans
Rochdale
Springfield Gardens
Cambria Heights
Rosedale
Far Rockaway
Broad Channel
Breezy Point
Steinway
Beechhurst
Bay Terrace
Edgemere
Arverne
Rockaway Beach
Neponsit
Murray Hill
Floral Park
Holliswood
Jamaica Estates
Queensboro Hill
Hillcrest
Ravenswood
Lindenwood
Laurelton
Lefrak City
Belle Harbor
Rockaway Park
Somerville
Brookville
Bellaire
North Corona
Forest Hills Gardens
Jamaica Hills
Utopia
Pomonok
Astoria Heights
Hunters Point
Sunnyside Gardens
Blissville
Roxbury
Middle Village
Malba
Hammels
Bayswater
Queensbridge
In [44]:
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))
There are 315 uniques categories.
In [45]:
# one hot encoding
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['Neighborhood'] = venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.shape
Out[45]:
(3785, 315)
In [46]:
grouped = onehot.groupby('Neighborhood').mean().reset_index()
print(grouped.shape)
(81, 315)
In [47]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
In [48]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = grouped['Neighborhood']

for ind in np.arange(grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()
Out[48]:
Neighborhood 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
0 Arverne Surf Spot Beach Deli / Bodega Sandwich Place Bus Stop Donut Shop Furniture / Home Store Café Bed & Breakfast Gas Station
1 Astoria Bar Greek Restaurant Middle Eastern Restaurant Pizza Place Seafood Restaurant Coffee Shop Hookah Bar Grocery Store Sandwich Place Indian Restaurant
2 Astoria Heights Rental Car Location Bakery Bus Station Pizza Place Café Italian Restaurant Hotel Laundromat Baseball Field Chinese Restaurant
3 Auburndale Korean Restaurant Cosmetics Shop Pizza Place Sushi Restaurant Greek Restaurant Sandwich Place Bank Pet Store Pharmacy Gym / Fitness Center
4 Bay Terrace Clothing Store Cosmetics Shop Mobile Phone Shop Women's Store Kids Store Lingerie Store American Restaurant Donut Shop Shoe Store Men's Store

Segmentation (Clustering)

In [54]:
# set number of clusters
kclusters = 6

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 
Out[54]:
array([1, 0, 0, 0, 0, 0, 2, 0, 3, 1])
In [55]:
#neighborhoods_venues_sorted = neighborhoods_venues_sorted.drop('Cluster Labels', axis=1)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
neighborhoods_merged =pd.merge(neighborhoods, neighborhoods_venues_sorted, on='Neighborhood', how='inner')
#
neighborhoods_merged = neighborhoods_merged.merge(df_ny_stats, on='Borough')

neighborhoods_merged.head() # check the last columns!
Out[55]:
Borough Neighborhood Latitude Longitude Cluster Labels 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue county population gdp_billions gdp_capita sqms sqkm pers_sqms pers_sqkm sale_price
0 Queens Astoria 40.768509 -73.915654 0 Bar Greek Restaurant Middle Eastern Restaurant Pizza Place Seafood Restaurant Coffee Shop Hookah Bar Grocery Store Sandwich Place Indian Restaurant Queens 2253858 93.31 41400 108.53 281.09 20767 8018 709884.787686
1 Queens Woodside 40.746349 -73.901842 0 Thai Restaurant Pizza Place Bar Grocery Store Bakery Latin American Restaurant American Restaurant Pub Filipino Restaurant Donut Shop Queens 2253858 93.31 41400 108.53 281.09 20767 8018 709884.787686
2 Queens Jackson Heights 40.751981 -73.882821 0 Latin American Restaurant South American Restaurant Bakery Mexican Restaurant Peruvian Restaurant Pizza Place Thai Restaurant Grocery Store Coffee Shop Donut Shop Queens 2253858 93.31 41400 108.53 281.09 20767 8018 709884.787686
3 Queens Elmhurst 40.744049 -73.881656 0 Thai Restaurant Mexican Restaurant Chinese Restaurant Bakery South American Restaurant Latin American Restaurant Grocery Store Supermarket Bubble Tea Shop Snack Place Queens 2253858 93.31 41400 108.53 281.09 20767 8018 709884.787686
4 Queens Howard Beach 40.654225 -73.838138 0 Italian Restaurant Bagel Shop Pharmacy Fast Food Restaurant Sandwich Place Ice Cream Shop Park Sushi Restaurant Other Nightlife Bus Station Queens 2253858 93.31 41400 108.53 281.09 20767 8018 709884.787686
In [56]:
#neighborhoods_c1 = 
neighborhoods_c1 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 0][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c1["Cluster"] = 1

neighborhoods_c2 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 1][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c2["Cluster"] = 2

neighborhoods_c3 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 2][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c3["Cluster"] = 3

neighborhoods_c4 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 3][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c4["Cluster"] = 4

neighborhoods_c5 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 4][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c5["Cluster"] = 5

neighborhoods_c6 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 5][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c6["Cluster"] = 6

df_tmp = pd.concat([neighborhoods_c1, neighborhoods_c2])
df_tmp = pd.concat([df_tmp, neighborhoods_c3])
df_tmp = pd.concat([df_tmp, neighborhoods_c4])
df_tmp = pd.concat([df_tmp, neighborhoods_c5])
df_tmp = pd.concat([df_tmp, neighborhoods_c6])

df_tmp.sort_values (by=['Cluster'], ascending=True).reset_index()

df_final = df_tmp

Results

A map plotted to visualize the neighborhoods by its clusters - chloropleth map is created to highlight the boroughs by its house average sales.

In [57]:
from folium import ColorMap, LinearColormap, StepColormap

map_clusters = empty_map

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10.5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

neighborhoods_merged['text'] = neighborhoods_merged['Borough'] + neighborhoods_merged['Neighborhood'] + '<br>' #+ \
    #'Most Common Venue(s) ' + neighborhoods_merged['1st Most Common Venue'] + ', ' + neighborhoods_merged['2nd Most Common Venue']

#' Dairy ' + neighborhoods_merged['dairy'] + '<br>' + \
    #'Population ' + neighborhoods_merged['total fruits'] + ' Per SQM ' + neighborhoods_merged['total veggies'] + '<br>' + \
    
# add markers to the map
markers_colors = []

with open('nyborough.geojson') as json_data:
    nyborough_data = json.load(json_data)
    
map_clusters.choropleth(
    geo_data=nyborough_data,
    data=df_ny_stats,
    columns=['Borough', 'sale_price'],
    key_on='feature.properties.BoroName',
    fill_color='YlOrRd', 
    #text=df_ny_stats['Borough'], # hover text
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='House Price in NewYork'#,
    #marker_line_color='white', # line markers between states
    #colorbar_title="House Price in NewYork"
)

for cluster, lat, lon, poi, borough, avg_hp in zip(
                                  neighborhoods_merged['Cluster Labels'], 
                                  neighborhoods_merged['Latitude'], 
                                  neighborhoods_merged['Longitude'], 
                                  neighborhoods_merged['Neighborhood'], 
                                  neighborhoods_merged['Borough'],
                                  neighborhoods_merged['sale_price']):
    label = folium.Popup( 'cluster '+str(cluster) + ' - ' + str(poi) + ' : ' + str(int(avg_hp)) + '. NH :' , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters) 

map_clusters
Out[57]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Cluster visualization, bar graphs developed to visualize what venue categories are available for each clusters.

In [58]:
plt.figure(figsize=(8,12))
plt.suptitle('Cluster Analysis', y=1.03, fontsize=16, fontweight='semibold');

plt.subplot(3, 2, 1);
g = sns.barplot(data=neighborhoods_c1, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 1 (0)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');

plt.xticks(rotation=90)

plt.subplot(3, 2, 2);
g = sns.barplot(data=neighborhoods_c2, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 2 (1)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');

plt.xticks(rotation=90)

# code for the pie chart
plt.subplot(3, 2, 3);
g = sns.barplot(data=neighborhoods_c3, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 3 (2)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');

plt.xticks(rotation=90)

plt.suptitle('Cluster Analysis', y=1.03, fontsize=16, fontweight='semibold');
base_color = sns.color_palette()[0]
# code for the bar chart
plt.subplot(3, 2, 4);
g = sns.barplot(data=neighborhoods_c4, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 4 (3)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');

plt.xticks(rotation=90)

plt.subplot(3, 2, 5);
g = sns.barplot(data=neighborhoods_c5, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 5 (4)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');

plt.xticks(rotation=90)

# code for the pie chart
plt.subplot(3, 2, 6);
g = sns.barplot(data=neighborhoods_c6, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 6 (5)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');

plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

Clusters are categorized as follows

  • Cluster 1 - Food/Clothing.
  • Cluster 2 - Beaches/sports
  • Cluster 3 - Sports
  • Cluster 4 - Food/Harbor
  • Cluster 5 - Beach
  • Cluster 6 - Park

Data exploration results helps to understand the population denisty , GDP and housing pricing to compare and decide which borough to choose for clustering analysis. From the results on clustering, based on the type of business one is investing can choose the cluster and the map helps in selecting the area in which the business can be built.

Discussion

As mentioned in the introduction, Newyork is a popular state in US and has very good GDP. The city has 5 boroughs and for this project I have considered Queens based on the real estate pricing (low) with good population density and good GDP rate. Neighborhood data is filterd by Queens borough and the venues for these neighborhoods is collected using FourSquare API. Then I have taken only the top 10 venues per neighborhood based on the number of the venue categories listed per neighborhood. I have used K-Means clustering to cluster the top venues into 3 clusters and cluster based venues helps in defining the cateogories - then using this categorization a map was generated to display the neighborhood and its respective category. Also the boroughs are classified based on the housing price index to highlight where each borough stands in a visual way to understand.

Conclusion

The popular city new york in United State has Manhattan as the most expensive borough with median population comparing with its peer boroughs. Manhattan also provides 72% of the GDP. Business investors with high budget can opt for Manhattan which has better GDP. Queens with its high population and low average housing cost could be a better investment option. Cluster 1/4 with high number of restaurant venues should be a good place for investing on related business that supports restaurants and bars. eg. Food supplies etc.

Also further analysis can be done for different boroughs, cities depending on the investors needs. This can be enhanced further to support government institutions to help improve the efficiency.

References

  • https://en.wikipedia.org/wiki/New_York_City
  • https://www.kaggle.com/new-york-city/nyc-property-sales
  • https://geo.nyu.edu/catalog/nyu_2451_34572
  • https://developer.foursquare.com/
  • https://www.google.com/maps/
  • http://localhost:8888/notebooks/NewYork.jpg